Goto

Collaborating Authors

 speech summarization


Summarizing Speech: A Comprehensive Survey

arXiv.org Artificial Intelligence

Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization remains loosely defined. The field intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches, but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions. In doing so, we surface the ongoing challenges, such as the need for realistic evaluation benchmarks, multilingual datasets, and long-context handling.


Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning

arXiv.org Artificial Intelligence

Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.


Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques

arXiv.org Artificial Intelligence

Description of the datasets used and of the designed approaches; Remote communication through video or audio conferences has become more popular than ever because of the worldwide pandemic. Experiment procedure and results; These events, therefore, have provoked the development Conclusion and further work. of systems for automatic minuting of spoken language leading to AutoMin 2021 challenge. The following paper illustrates the In this project we aim to achieve three main goals. First, we results of the research that team MTS has carried out while participating wish to analyze the existing pre-trained summarization models in the Automatic Minutes challenge. In particular, in and compare their performance in summarization of manually this paper we analyze existing approaches to text and speech transcribed audio recordings. Second, we propose a custom unsupervised summarization, propose an unsupervised summarization technique approach for summarization of English texts. Last based on clustering and provide a pipeline that includes but not least, we adapt our summarization module for multichannel an adapted automatic speech recognition block able to run on meeting audio recordings and made the pipeline opensource.


Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?

arXiv.org Artificial Intelligence

Reference summaries for abstractive speech summarization require human annotation, which can be performed by listening to an audio recording or by reading textual transcripts of the recording. In this paper, we examine whether summaries based on annotators listening to the recordings differ from those based on annotators reading transcripts. Using existing intrinsic evaluation based on human evaluation, automatic metrics, LLM-based evaluation, and a retrieval-based reference-free method. We find that summaries are indeed different based on the source modality, and that speech-based summaries are more factually consistent and information-selective than transcript-based summaries. Meanwhile, transcript-based summaries are impacted by recognition errors in the source, and expert-written summaries are more informative and reliable. We make all the collected data and analysis code public(https://github.com/cmu-mlsp/interview_humanssum) to facilitate the reproduction of our work and advance research in this area.


An End-to-End Speech Summarization Using Large Language Model

arXiv.org Artificial Intelligence

Abstractive Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. It encounters difficulties in handling long speech input and capturing the intricate cross-modal mapping between long speech inputs and short text summaries. Research on large language models (LLMs) and multimodal information fusion has provided new insights for addressing these challenges. In this paper, we propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality and employs LLMs to generate text summaries directly from speech features. We adopt a multi-stage training approach that includes LLM based ASR and Text Summarization (TSum) tasks as auxiliary tasks. ASR tasks are used to align feature spaces and enhance the LLM's ability to handle longer speech. Then, we utilize a curriculum learning strategy to facilitate the model's transition from TSum to SSum. Finally, our model achieves competitive performance on the How-2 dataset.


Prompting Large Language Models with Audio for General-Purpose Speech Summarization

arXiv.org Artificial Intelligence

In this work, we introduce a framework for speech summarization Our model is trained using the concept of modality invariance-- that leverages the processing and reasoning capabilities of the idea that, given certain semantic information in a prompt, large language models (LLMs). We propose an end-to-end system the LLM should provide the same response regardless of the that combines an instruction-tuned LLM with an audio encoder prompt's modality [12]. Specifically, we use an ASR dataset that converts speech into token representations that the with paired speech-text data; while keeping the LLM weights LLM can interpret. Using a dataset with paired speech-text data, frozen, we train the audio encoder to convert speech inputs into the overall system is trained to generate consistent responses to token representations that the LLM can interpret. Then, the endto-end prompts with the same semantic information regardless of the system is guided to produce the same output as when text input modality. The resulting framework allows the LLM to is the input using next-token prediction loss. We additionally process speech inputs in the same way as text, enabling speech incorporate knowledge distillation using the response from the summarization by simply prompting the LLM. Unlike prior approaches, corresponding text input as the teacher model, utilizing feature our method is able to summarize spoken content from and logit distillation losses to guide the model to produce more any arbitrary domain, and it can produce summaries in different consistent responses from speech inputs.


AugSumm: towards generalizable speech summarization using synthetic labels from large language model

arXiv.org Artificial Intelligence

Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary for every recording. Generating multiple human references would be ideal to better represent the distribution statistically, but is impractical because annotation is expensive. We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation. First, we explore prompting strategies to generate synthetic summaries from ChatGPT. We validate the quality of synthetic summaries using multiple metrics including human evaluation, where we find that summaries generated using AugSumm are perceived as more valid to humans. Second, we develop methods to utilize synthetic summaries in training and evaluation. Experiments on How2 demonstrate that pre-training on synthetic summaries and fine-tuning on GT summaries improves ROUGE-L by 1 point on both GT and AugSumm-based test sets. AugSumm summaries are available at https://github.com/Jungjee/AugSumm.


BASS: Block-wise Adaptation for Speech Summarization

arXiv.org Artificial Intelligence

End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline.


Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

arXiv.org Artificial Intelligence

End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. To overcome this drawback, we propose for the first time to integrate a pre-trained language model (LM), which is highly capable of generating natural sentences, into the E2E SSum decoder via transfer learning. In addition, to reduce the gap between the independently pre-trained encoder and decoder, we also propose to transfer the baseline E2E SSum encoder instead of the commonly used automatic speech recognition encoder. Experimental results show that the proposed model outperforms baseline and data augmented models.


Leveraging Large Text Corpora for End-to-End Speech Summarization

arXiv.org Artificial Intelligence

End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large amount of paired data (i.e., speech and summary) is difficult, the training data is usually insufficient to train a robust E2E SSum system. In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary. The second is a TTS-free method that directly inputs phoneme sequence instead of synthesized speech to the E2E SSum model. Experiments show that our proposed TTS- and phoneme-based methods improve several metrics on the How2 dataset. In particular, our best system outperforms a previous state-of-the-art one by a large margin (i.e., METEOR score improvements of more than 6 points). To the best of our knowledge, this is the first work to use external language resources for E2E SSum. Moreover, we report a detailed analysis of the How2 dataset to confirm the validity of our proposed E2E SSum system.